The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications. However, standard methods for evaluating these metrics have yet to be established. We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics, allowing sensible comparison of their behavior. We demonstrate the effectiveness of our evaluation measures in capturing fundamental characteristics by evaluating them on a collection of classical and state-of-the-art metrics. Our measures revealed that recently-developed metrics are becoming better in identifying semantic distributional mismatch while classical metrics are more sensitive to perturbations in the surface text levels.
translated by 谷歌翻译
基于机器学习(ML)的系统的制作需要在其生命周期中进行统计控制。仔细量化业务需求和识别影响业务需求的关键因素降低了项目故障的风险。业务需求的量化导致随机变量的定义,表示通过统计实验需要分析的系统关键性能指标。此外,可提供的培训和实验结果产生影响系统的设计。开发系统后,测试并不断监控,以确保其符合其业务需求。这是通过持续应用统计实验来分析和控制关键绩效指标来完成的。本书教授制作和开发基于ML的系统的艺术。它倡导“首先”方法,强调从项目生命周期开始定义统计实验的需要。它还详细讨论了如何在整个生命周期中对基于ML的系统进行统计控制。
translated by 谷歌翻译
基于培训数据的各种统计特性,对基于统计数据(ML)技术概括或学习的基于统计数据。基础统计数据的假设导致理论或经验性能担保是培训数据的分布代表了生产数据分布。这个假设经常破裂;例如,数据的统计分布可能会改变。我们术语改变会影响ML性能“数据漂移”或“漂移”。许多分类技术对其结果计算了信心的衡量标准。该措施可能不会反映实际的ML表现。一个着名的例子是熊猫图片,正确地归类为距离约60 \%,但是当添加噪音时,它被错误地被归类为长臂猿,置信度高于99 \%。但是,我们在此报告的工作表明,分类器的置信度量可用于检测数据漂移的目的。我们提出了一种完全基于分类器建议标签的方法及其对其的信心,用于警告可能导致数据漂移的数据分布或功能空间变化。我们的方法标识在模型性能下劣化,并且不需要在生产中标记通常缺乏或延迟的生产中的数据。我们的三种不同数据集和分类器的实验证明了这种方法在检测数据漂移方面的有效性。这特别令人鼓舞,因为分类本身可能是或可能不正确,并且不需要模型输入数据。我们进一步探索了顺序变化点测试的统计方法,以便自动确定要识别漂移的数据量,同时控制误率(类型-1错误)。
translated by 谷歌翻译
API经济是指API(高级编程界面)微猎犬的广泛集成,软件应用程序可以相互通信,作为业务模型和功能的重要元素。可以使用这种系统的可能方式的数量是巨大的。因此,希望监视使用模式并识别系统以以前从未使用的方式使用。这为系统分析师提供了警告,并且可以确保系统不间断运行。在这项工作中,我们分析了API使用量的直方图和呼叫图,以确定系统的使用模式是否已移位。我们比较非参数统计和贝叶斯顺序分析对问题的应用。这是以一种克服重复统计测试问题的方式完成,并确保警报的统计显着性。该技术被模拟和测试,并证明有效地检测各种场景的漂移。我们还提到了对技术的修改来减少其存储器,以便在监控开始时发生在分布漂移时可以更快地响应。
translated by 谷歌翻译
考虑一个结构化的特征数据集,例如$ \ {\ textrm {sex},\ textrm {compy},\ textrm {race},\ textrm {shore} \} $。用户可能希望在特征空间观测中集中在哪里,并且它稀疏或空的位置。大稀疏或空区域的存在可以提供软或硬特征约束的域知识(例如,典型的收入范围是什么,或者在几年的工作经验中可能不太可能拥有高收入)。此外,这些可以建议用户对稀疏或空区域中的数据输入的机器学习(ML)模型预测可能是不可靠的。可解释的区域是一个超矩形,例如$ \ {\ textrm {rame} \ in \ {\ textrm {black},\ textrm {white} \} \} \} \&$ $ \ {10 \ leq \ :\ textrm {体验} \:\ leq 13 \} $,包含满足约束的所有观察;通常,这些区域由少量特征定义。我们的方法构造了在数据集中观察到的特征空间的基于观察密度的分区。它与其他人具有许多优点,因为它适用于原始域中的混合类型(数字或分类)的特征,也可以分开空区域。从可视化可以看出,所产生的分区符合人眼可能识别的空间分组;因此,结果应延伸到更高的尺寸。我们还向其他数据分析任务展示了一些应用程序,例如推断M1模型误差,测量高尺寸密度可变性以及治疗效果的因果推理。通过分区区域的超矩形形式可以实现许多这些应用。
translated by 谷歌翻译
训练有素的ML模型被部署在另一个“测试”数据集上,其中目标特征值(标签)未知。漂移是培训数据和部署数据之间的分配变化,这是关于模型性能是否改变的。例如,对于猫/狗图像分类器,部署过程中的漂移可能是兔子图像(新类)或具有变化特征(分布变化)的猫/狗图像。我们希望检测这些更改,但没有部署数据标签,无法衡量准确性。相反,我们通过非参数测试模型预测置信度变化的分布间接检测到漂移。这概括了我们的方法,并回避特定于域特异性特征表示。我们使用变更点模型(CPMS;参见Adams and Ross 2012)解决了重要的统计问题,尤其是在顺序测试中类型1误差控制。我们还使用非参数异常方法来显示用户可疑观察结果以进行模型诊断,因为更改置信度分布显着重叠。在证明鲁棒性的实验中,我们在MNIST数字类别的子集上进行训练,然后在各种设置中的部署数据中插入漂移(例如,看不见的数字类)(漂移比例的逐渐/突然变化)。引入了新的损耗函数,以比较不同水平的漂移类污染的漂移检测器的性能(检测延迟,1型和2个误差)。
translated by 谷歌翻译
Recent advances in deep learning have enabled us to address the curse of dimensionality (COD) by solving problems in higher dimensions. A subset of such approaches of addressing the COD has led us to solving high-dimensional PDEs. This has resulted in opening doors to solving a variety of real-world problems ranging from mathematical finance to stochastic control for industrial applications. Although feasible, these deep learning methods are still constrained by training time and memory. Tackling these shortcomings, Tensor Neural Networks (TNN) demonstrate that they can provide significant parameter savings while attaining the same accuracy as compared to the classical Dense Neural Network (DNN). In addition, we also show how TNN can be trained faster than DNN for the same accuracy. Besides TNN, we also introduce Tensor Network Initializer (TNN Init), a weight initialization scheme that leads to faster convergence with smaller variance for an equivalent parameter count as compared to a DNN. We benchmark TNN and TNN Init by applying them to solve the parabolic PDE associated with the Heston model, which is widely used in financial pricing theory.
translated by 谷歌翻译
Managing novelty in perception-based human activity recognition (HAR) is critical in realistic settings to improve task performance over time and ensure solution generalization outside of prior seen samples. Novelty manifests in HAR as unseen samples, activities, objects, environments, and sensor changes, among other ways. Novelty may be task-relevant, such as a new class or new features, or task-irrelevant resulting in nuisance novelty, such as never before seen noise, blur, or distorted video recordings. To perform HAR optimally, algorithmic solutions must be tolerant to nuisance novelty, and learn over time in the face of novelty. This paper 1) formalizes the definition of novelty in HAR building upon the prior definition of novelty in classification tasks, 2) proposes an incremental open world learning (OWL) protocol and applies it to the Kinetics datasets to generate a new benchmark KOWL-718, 3) analyzes the performance of current state-of-the-art HAR models when novelty is introduced over time, 4) provides a containerized and packaged pipeline for reproducing the OWL protocol and for modifying for any future updates to Kinetics. The experimental analysis includes an ablation study of how the different models perform under various conditions as annotated by Kinetics-AVA. The protocol as an algorithm for reproducing experiments using the KOWL-718 benchmark will be publicly released with code and containers at https://github.com/prijatelj/human-activity-recognition-in-an-open-world. The code may be used to analyze different annotations and subsets of the Kinetics datasets in an incremental open world fashion, as well as be extended as further updates to Kinetics are released.
translated by 谷歌翻译
Quantum computing (QC) promises significant advantages on certain hard computational tasks over classical computers. However, current quantum hardware, also known as noisy intermediate-scale quantum computers (NISQ), are still unable to carry out computations faithfully mainly because of the lack of quantum error correction (QEC) capability. A significant amount of theoretical studies have provided various types of QEC codes; one of the notable topological codes is the surface code, and its features, such as the requirement of only nearest-neighboring two-qubit control gates and a large error threshold, make it a leading candidate for scalable quantum computation. Recent developments of machine learning (ML)-based techniques especially the reinforcement learning (RL) methods have been applied to the decoding problem and have already made certain progress. Nevertheless, the device noise pattern may change over time, making trained decoder models ineffective. In this paper, we propose a continual reinforcement learning method to address these decoding challenges. Specifically, we implement double deep Q-learning with probabilistic policy reuse (DDQN-PPR) model to learn surface code decoding strategies for quantum environments with varying noise patterns. Through numerical simulations, we show that the proposed DDQN-PPR model can significantly reduce the computational complexity. Moreover, increasing the number of trained policies can further improve the agent's performance. Our results open a way to build more capable RL agents which can leverage previously gained knowledge to tackle QEC challenges.
translated by 谷歌翻译
Naturally-occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers to information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical when question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)$^2$ (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally-occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)$^2$, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. We find that current models do struggle with handling questionable assumptions -- the best performing model achieves 59% human rater acceptability on abstractive QA with (QA)$^2$ questions, leaving substantial headroom for progress.
translated by 谷歌翻译